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Abstract — We introduce a universal quantization scheme based on random coding, and we 
analyze its performance. This scheme consists of a source-independent random codebook (typically 
mismatched to the source distribution), followed by optimal entropy-coding that is matched to the 
quantized codeword distribution. A single-letter formula is derived for the rate achieved by this scheme 
at a given distortion, in the limit of large codebook dimension. The rate reduction due to entropy- 
coding is quantified, and it is shown that it can be arbitrarily large. In the special case of "almost 
uniform" codebooks (e.g., an i.i.d. Gaussian codebook with large variance) and difference distortion 
measures, a novel connection is drawn between the compression achieved by the present scheme and 
the performance of "universal" entropy-coded dithered lattice quantizers. This connection generalizes 
the "half-a-bit" bound on the redundancy of dithered lattice quantizers. Moreover, it demonstrates a 
strong notion of universality where a single "almost uniform" codebook is near-optimal for any source 
and any difference distortion measure. The proofs are based on the fact that the limiting empirical 
distribution of the first matching codeword in a random codebook can be precisely identified. This 
is done using elaborate large deviations techniques, that allow the derivation of a new "almost sure" 
version of the conditional limit theorem. 

Index Terms — Rate-distortion theory, random coding, mismatch, universal quantization, 
universal Gaussian codebook, pattern-matching, large deviations, data compression, robustness. 
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1 Introduction 



1.1 Mismatched Quantization and Compression 

Variable-rate lossless compression - or entropy- coding - is an efficient method for enhancing the com- 
pression performance of quantizers |121 Hj. This paper investigates the role of entropy-coding when 
the quantizer codebook is mismatched with respect to the source distribution. Our motivation mainly 
comes from Ziv's concept of universal quantization for lossy compression of real-valued sources with 
unknown statistics Ziv's scheme uses a randomized ("dithered") lattice quantizer, which is scaled 
to meet the target distortion level, and the quantizer is followed by a universal lossless encoder which 
reduces the coding rate to the true entropy of the quantized sequence. Neuhoff [2H1 suggested that 
the universal quantizer could be viewed as an efficient combination of a "simple" robust quantizer 
and a "complex" lossless encoder. Variations on the problem of entropy-coded dithered quantization 
(ECDQ) can be found in [H 01 |33] . 

Intuitively, the quantizer mismatch leaves much room for rate savings using entropy-coding. More- 
over, unlike optimum entropy-constrained vector quantization (ECVQ) I22j . the entropy-coding 
gain in the mismatched case does not vanish even in the limit of large vector dimension. For a rather 
trivial example, note that the un-coded rate of an unbounded lattice quantizer is infinite, but it be- 
comes finite after entropy-coding if the source has finite variance. The advantage of entropy-coding 
a mismatched quantizer is particularly prominent at high resolution quantization conditions. Gray 
and Linder show that optimum high-rate performance for mean-squared distortion can be achieved 
even if the quantizer codebook is mismatched with respect to the source (specifically, if it is designed 
for a source with uniform density), as long as the quantizer output is entropy-coded according to the 
true quantizer output distribution \1'A\ sec. VII]. As we shall see here, similar behavior occurs at any 
resolution, only with a slight rate loss due to the codebook mismatch. 

One of the central results of universal quantization theory is that, after entropy-coding, the rate 
loss of the universal quantizer with respect to the optimum ECVQ is bounded for all sources and all 
distortion levels by a universal constant [371 132j . For example, for squared error distortion, the rate 
loss of a A;-dimensional lattice ECDQ is bounded by (1/2) log(47reGfc) bits, where is the normalized 
second moment of the lattice; this bound is ~ 0.754 bits for k = 1, and it converges to 1/2 bit as 
k — > oo (where log = log 2 ). These results are limited, however, to lattice structured quantizers, and 
more specifically to those lattice dimensions and distortion measures which are covered by lattice 
coding theory. 

The central goal of this paper is to develop a structure-free framework for mismatched, entropy- 
coded quantization at an arbitrary distortion level, based on random coding ideas and techniques. The 
random coding framework, although not constructive, allows us to precisely quantify two important 
operational quantities: (a) The potential rate gain due to entropy-coding when using a mismatched 
random codebook; equivalently, this can be thought of as the rate loss of the straightforward scheme 
which uses a mismatched codebook without entropy-coding, (b) The rate loss due to quantizer mis- 
match, over the optimal rate-distortion function: We will derive a universal upper bound for this rate 
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loss, analogous to the half-a-bit bound for lattice ECDQs and quadratic distortion described above. 

Mismatched random codebooks for fixed-rate lossy source coding have been investigated by Sakrison 
[HilHZ!) Zhang and Wei jSH!) Lapidoth j5U], Zamir and Rose [SI], and others. See j^l EE] and the 
references therein. Specifically, source coding with a mismatched random codebook (or string matching 
with a mismatched database) has been considered by Steinberg and Gutman [2Hj, Yang and Kieffer |29| . 
and Dembo and Kontoyiannis [Bj, among others. These works (and the references therein) develop an 
extensive theory of mismatched random lossy coding in the limit of large codebook dimension, with 
an emphasis on precisely characterizing the asymptotic rate and the redundancy of these schemes. 
Here we continue that investigation, but we introduce the additional step of entropy-coding the index 
of the codebook before transmitting it to the decoder. 

Entropy coding the codeword index in a source-matched random lossy codebook has been consid- 
ered in the early work by Pinkston |24| . Mismatched high resolution quantization has been considered 
by Bucklew [3] and by Gray and Linder where several results, some of which parallel those derived 
here, are presented. Preliminary results on the entropy rates achieved by mismatched random code- 
books for general (non vanishing) distortions and discrete memoryless sources appear in |31j . Here we 
strengthen these results, and extend them to richer classes of sources and codebook distributions. In 
particular, we establish a formal connection between ECDQ and entropy-coded random codebooks. 

1.2 Discussion of Main Results 

We begin in Section 2, where we derive asymptotic single-letter characterizations for the compression 
rate achieved by two different coding schemes, both based on a random codebook C n = {Y 1 n (i), i = 
1,2,...} consisting of i.i.d. n-dimensional words Y™, each having i.i.d. components generated by an ar- 
bitrary distribution Q. We shall discuss later specific interesting choices for the codebook distribution 
Q. A natural motivation for the use of a mismatched Q is the observation that, in many important 
applications, the source statistics are generally unknown a priori or they change with time - or both. 

Given a source string Xf to be compressed with distortion D or less, we consider the index N n of 
the first codeword in C n that matches Y™ within distortion D. Our first result says that as n — > oo, 
the empirical distribution of this first matching word converges to a distribution Q*pqd which can 
be identified as the solution of an single-letter minimization problem. [Here P = P\ denotes the 
first-order marginal of the source distribution.] The proof is based on large deviations techniques, and 
generalizes the "favorite type theorem" of |34j . 

Using this result we establish an upper bound on the rate achieved when such a random codebook 
is used in conjunction with entropy-coding: Suppose that the encoder first finds the first D-close 
match at position N n , and then entropy-codes the index N n conditional on the codebook C n . The 
rate achieved for this D-accurate description of Xf is 

H(N\C) = lim -H{N n \C n ) bits/symbol. 

n— >oo fi 

We then compare our bound with the limiting rate R(P, Q, D) achieved in the "naive coding" scenario, 
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where the encoder simply transmits the index N n using Elias' code for the integers, 

R(P,Q,D)= lim -\og(N n ) bits/symbol. 

n— +00 n 

We show that the rate gain of entropy-coding over the naive coding scheme (or, equivalently, the rate 
loss of naive coding) satisfies 

rate gain = R(P, Q, D) - H{N\C) > H(Q* PQD \\Q) bits/symbol, 

i.e., it is at least as large as the relative entropy between the limiting empirical distribution of Y{ l (N n ) 
and the codebook-generating distribution Q. For example, it is approximately H(P\\Q) for small 
mean squared distortion D. This lower bound is strictly positive, unless Q is the optimal reproduction 
distribution (i.e., the optimal output distribution of the rate-distortion function). 

This expression resembles the rate loss due to mismatch in the lossless component of the code 
at high resolution quantization (see, e.g., Indeed, for small mean-squared distortion we have 

H{N\C) rs h{P) - \ \og(2ireD), and R{P,Q,D) « h{P) - \ \og{2mD) + H(P\\Q), hence the latter 
amounts to encoding a source ~ P using a code designed for a source ~ Q. At non-high resolution 
conditions, however, the rate loss remains positive even if the lossless component of the code is matched, 
i.e., H(N\C) is in general strictly above the rate-distortion function. 

Of particular interest is the case of universal Gaussian codebooks: Suppose we encode a real- 
valued memoryless source using a white Gaussian codebook ~ N(0, r 2 ), with respect to squared error 
distortion. If we simply use this codebook in the "naive" sense described above, robust source coding 
theory implies that taking — D, where a 2 denotes the source variance, guarantees achieving 

the Gaussian rate-distortion function R(D) = | log(cr 2 /D) for any source [213 ED]- Since the Gaussian 
source is the hardest to compress in this class, this implies high redundancy when the source is far 
from Gaussian. 

On the other hand, if we also allow the encoder to entropy code the index, the results are funda- 
mentally different. If we take the codebook variance r 2 to be large, the codebook distribution becomes 
flat and it is tempting to think that the codebook itself looks approximately like the codebook of a 
lattice quantizer. Indeed, we show that as r 2 — > 00 the rate H(N\C) achieved by entropy-coding 
this Gaussian codebook is no greater than the rate of a dithered lattice quantizer with large lattice 
dimension, given by 

limsuptf (iV|C) < I(X;X + Z D ) bits/symbol, 

T 2 — >0O 

where I(X; X + Zjj) denotes the mutual information between the first source symbol X and X + Zd, 
where Zd is an independent iV(0, D) random variable. Combining this with well-known facts about 
universal quantizers [S3 E2i it follows that the naive coding rate is going to infinity as r 2 — > 00, 
whereas the limiting rate achieved by entropy-coding, I(X;X + Zd), is at most 1/2 bit above the 
rate-distortion function of X, and it coincides with the rate-distortion function of X in the limit of 
small D. This new derivation provides an interesting bridge between universal quantization theory 
and mismatched random coding. 
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As observed by Yang and Kieffer [201, the naive coding rate R(P,Q,D) depends only on the 
first-order marginal of the source distribution, hence it does not benefit from memory in the source. 
Moreover, as shown in detail in the sequel, the entire dependence of the naive coding rate on the 
source distribution P may be very weak (in fact, sometimes it is entirely independent of P), thus it 
is far from being optimal. As we shall see, these disadvantages are eliminated by the use of entropy 
coding. 

In particular, for the case of memoryless Gaussian codebooks with large variance, we argue that 
the entropy-coding scheme achieves a rate no greater than the mutual information rate I(X , X + Z£>), 
where Zd is an independent white Gaussian process with variance D. As before, this in turn implies 
that the rate of the entropy-coded scheme is no greater than R(D) + 1/2 bits/symbol, where R(D) is 
the rate distortion function of the entire source (not just the first-order rate-distortion function). 

We also show that these results generalize beyond the Gaussian codebook case to a much wider 
class of codebooks, namely, "approximately flat" codebooks with distributions of exponential type, 
and to general difference distortion measures. We quantify the entropy-coding gain in this case, and 
show that the resulting compression rate is bounded above by R{D) + C* bits/symbol, where R(D) 
is the rate-distortion function of the source, and C* is an upper bound for the "min-max capacity" 
defined in j3H]- This is a new generalization of the well-known half-a-bit bound derived for dithered 
lattice quantizers and squared distortion [32] to the case of general difference distortion measures. 
Moreover, it implies the existence of a single ensemble of codebooks which is universal with respect to 
both the source distribution and the distortion criterion, resembling the results of Yang and Kieffer in 

The paper is organized as follows. In Section [21 we describe in detail the entropy-coding scenario 
and the naive coding scheme based on a mismatched random codebook, and we state the main result on 
the index entropy in Theorem 1. Section [31 contains two important examples illustrating the entropy- 
coding gain, including the case of universal Gaussian codebooks mentioned above. In Section 0] we 
state and prove an almost sure conditional limit theorem (Theorem 3), which forms the basis for the 
favorite type theorem (Theorem 2) and for the proof of Theorem 1 which is given in Section [SJ Finally, 
in Section El we give tighter bounds on the index entropy and the entropy-coding gain for sources with 
memory. 

2 The Performance of Mismatched Codebooks 

In this section we characterize the compression performance achieved by memoryless random code- 
books when used to compress data generated from a stationary ergodic source. Two coding scenarios 
are considered: The "naive coding" scenario where data is simply described by the index of the first 
match in the codebook, and the "entropy-coded" case where this index is entropy-coded. 

In Section[3]we compare the performance of these two schemes, and explicitly evaluate the entropy- 
coding gain in two important special cases. 
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2.1 Notation and Definitions 



We begin by introducing some basic definitions and notation that will remain in effect for the rest of 
the paper. 

Consider a stationary ergodic process (or source) X = {X n ; n > 1} taking values in the source 
alphabet A. We will assume throughout that A is a complete, separable metric space (often called a 
Polish space) , equipped with its associated Borel cr-field A. For the sake of simplicity we also make the 
(rather harmless) assumption that all singletons are measurable, i.e., {x} G A for all x G A. Similarly, 
for the reproduction alphabet A we take (A, A) to be the Borel measurable space corresponding to a 
complete, separable metric (or Polish) space A and assume that {y} G A for all y G A. We write X\ for 
the vector of random variables X\ = (JQ, Xj+i, . . . , Xj), and similarly x\ = (xj, Xj+i, . . . , Xj) G A>~ l+l 
for a realization of these random variables, — oo < i < j < oo. We let P n denote the marginal 
distribution of X™ on A n (n > 1), and write P for the distribution of the whole process. We use P for 
the first-order marginal P\. 

Given an arbitrary nonnegative (measurable) function p : A x A — ► [0, oo), define a sequence of 
single-letter (or "additive") distortion measures p n : A n x A n — > [0, oo) by 

i n 

PnW,y?) = - J>(x,,y,) x? G A n , y n x G >. 
t=i 

For a distortion level D > and a source string x™ G A n , we write B(xi,D) for the distortion-ball of 
radius D around x": 

BK,D) = KeA« : pn(x^)<D}. 

Throughout the paper, log denotes the logarithm to base 2 and In denotes the natural logarithm. 
Unless otherwise mentioned, all familiar information-theoretic quantities (entropy, mutual information, 
and so on) are defined in terms of logarithms taken to base 2, and are therefore expressed in bits. 

2.2 Random Codebooks 

Given a probability measure Q on the reproduction alphabet A, a menioTyless random codebook C n 
with distribution Q is an infinite sequence of i.i.d. random vectors Y™(i), % > 1, with each Y™{i) being 
distributed according to the product measure Q n on A n . In other words, the components of Y™(i) are 
i.i.d. with distribution Q. We write 

C n = {Y?{i) ■ i > 1} 

for the entire codebook, and we call Q the codebook distribution. 

Suppose that, for a fixed n, this codebook is available to both the encoder and decoder. Given a 
distortion level D and a source string Xf to be described with distortion D or less, the encoder looks 
for a D-close match of X™ into the codebook C n . Let N n be the position of the first such match, 

iV n =inf{*>l : p n {X?,Y?(i))<D}, 
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with the convention that the infimum of the empty set equals +00. Roughly speaking, the way the 
encoder describes A" is by describing the position N n of this first match. 
Given a codebook distribution Q on A, we define 

Anin = E P [essmf p(X,Y)} 

D av = E PxQ [p(X,Y)], 

where P = P\ denotes the first-order marginal of X. 1 We will assume throughout that D aw is finite. 
Clearly < -D rmn < D av . To avoid the trivial case when p(x,y) is constant for (P-almost) all x £ A, 
we assume that with positive P-probability p(x, y) is not essentially constant in y, that is: 

Note also that for D greater than L> av the rate-distortion function R(D) of X is zero, and that for 
D below -D m i n no match can ever be found. Therefore, from now on we restrict our attention to the 
interesting range of distortion levels D e (-D mm , -D av )- 

We consider two possible ways in which the encoder can transmit N n : The simplest thing to do 
is describe N n directly, using some predetermined code for the positive integers; see, e.g., [H3- This 
can be done with approximately log(iV n ) bits. Alternatively, once the codebook C n has been fixed, 
the encoder may choose to "entropy-code" N n , giving it an average description length of roughly 
H(N n \C n ) bits. This is equivalent to re-ordering the codewords according to decreasing order of 
probabilities, and then describing the new index 7r(N n ) using approximately log(7r(7V n )) bits like above. 
When the statistics of the source are a-priori unknown, we use a universal algorithm to entropy-code 
(or re-order) the codewords. 

2.3 Naive Coding 

First we consider the case when the encoder describes the index N n without entropy-coding; we refer 
to this scenario as "naive coding." As mentioned in the Introduction, this coding scheme (and many 
variations on it) has been analyzed extensively in [28 29 8] and several other works cited therein. 
To avoid potentially infinite searches in the codebook, we make the simplifying assumption that the 
encoder only describes N n when it is smaller than 2 nb , where b is some positive constant to be chosen 
later. Accordingly, we define the truncated index N' n : 

N , A ( N n , if N n < [2 nb \, 

\ [2 nb \ + 1, otherwise. 

When N n exceeds |_2 nb J, the encoder uses an alternative description for A". In order to ensure that 
such a description can be given with finite rate, we introduce the following simple conditions; cf. 

mnniHi. 

1 Recall that the essential infimum of a function g(Y) of the random variable Y with distribution Q is defined as 
essinfy^Q g(Y) = sup{t SE R : Q{g(Y) > t} = 1}. 
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(WQC): For a distortion level D > we say that the weak quantization condition (WQC) holds 
at D if there is a (measurable) scalar quantizer q : A — > B C A such that B is a finite or countably 
infinite set, and 

p(x,q(x)) < -D for all x € A 

(pSQC): For a distortion level D > Owe say that the p- strong quantization condition (pSQC) holds 
at D for some p > 1, if (WQC) holds with respect to a scalar quantizer q also satisfying 

M p =\ {EpK-log^q^W}} 1 ^ <oo, 

where /i denotes the (discrete) distribution of the quantized random variable q(X). 

Note that for all p' > p > 1 we clearly have (p'SQC) => (pSQC) =4> (WQC), and that if the 
quantizer q of (WQC) has finite range then (pSQC) automatically holds for all p > 1. In particular, 
(1SQC) amounts simply to the requirement that there exists an appropriate scalar quantizer q with 
H(q(X 1 )) < oo. 

The encoder describes X™ with distortion D or less in two steps. First, a description of N' n is given 
using Elias' code for the integers ^U]. This takes 

log N' n + 2 log log N' n + O(l) bits. (1) 

If N' n < [2 nb \ , then N n = N' n and the above description is sufficient for the decoder to recover a 
D-close version of X™ from the codebook, so the second step is omitted. And if N' n = [2 nb \ + 1, then 
the encoder also gives a representation of X" with distortion D or less using the quantizer q provided 
by (WQC). This can be given in 

n 

£r-l°g/*(g(X0)l bits - 

1=1 

Let £ n (X™) denote the overall description length of the algorithm just described. As we will see the 
constant b can be chosen in such a way that N n will eventually be small enough so that the encoder 
will never need to resort to the alternative coding method. Therefore, in view of ifTjl. to understand 
this code's compression performance (i.e., to understand the asymptotic behavior of £ n (X{ 1 )) it suffices 
to understand the behavior of (log N n ) for large n. 

Suppose that a source string X™ is given; the probability that any particular codeword Y™(i) 
matches Xf with distortion D or less is Q n (B(X™, D)). If this probability is nonzero, then, conditional 
on Xf, the distribution of N n is geometric with parameter Q n (B(X'i , D)). From this observation it is 
easy to deduce that N n is close to its mean, namely l/Q n (B(Xf , £))), when n is large. The following 
result is an easy consequence of this fact and of Theorem B below. 

Theorem A. Naive Coding Performance. ,8_: Suppose that X is a stationary ergodic source with 
first-order marginal P\ = P, and that Q is an arbitrary codebook distribution on A with D av < oo. If 
D £ (-D m i n , -Dav) and X satisfies (WQC) at D, then for almost every sequence of memoryless random 
codebooks C n with distribution Q: 

lim -IJX?) = lim -logiV n = R(P,Q,D) bits/symbol, w.p.l. 

n— >oo n n— >oo n 
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The rate-function R(P, Q, D) is denned as 



R(P,Q,D) = m£H(W\\P x Q) , (2) 
w 



where JJ(W[|V) denotes the relative entropy between two probability measures W and V, 
H(W\\V) 



A J E w [log^K], if the density exists, 



oo, otherwise, 

and the infimum in © is taken over all joint distributions W on Ax A such that the first marginal of 
W is P and E\y[p(X, Y)] < D. This result holds as long as the constant b is chosen b > R(P, Q, D). 

See Section El for specific examples where the asymptotic rate R(P,Q,D) can be explicitly evalu- 
ated. 

Theorem B. (Hj: Let X be a stationary ergodic source with first-order marginal distribution P, 
and let Q be an arbitrary codebook distribution on A with D av < oo. Then for all D G (-Dmim D av ): 



lim --lo g g n (B(Xr,D)) = J2(P,Q,D) w.p.l. 

n— *oa n 

2.4 Entropy-Coding the Index 

Next we consider the case of entropy-coding, where, after the codebook C n has been fixed, the encoder 
uses the conditional distribution (g iven C ji) of the position of the first Z^-close match to optimally 
describe this position to the decoder: The truncated index N' n is first described using H(N' n \C n )+0{l) 
bits, on the average. As before, if N n < [2 nb \ this offers a complete D-close representation of X™. 
Otherwise, the encoder adds to this an alternative representation of X" using the quantizer q provided 
by (WQC). On the average, this takes 

n 

£#p(r-tog/x(gTO)l) bits. 



i=i 

Let C n {X™) denote the overall description length of the above coding scheme. Next we give an 
upper bound on the asymptotic rate it achieves. Given an arbitrary codebook distribution Q on A, 
define the output- constrained rate- distortion function (or lower mutual information (LMI)) j^EH] by 

I m (P\\Q,D)= inf I(X;Y), (3) 

X~P, V~Q Ep(X,Y)<D 

where I(X; Y) denotes the mutual information between X and Y, and the infimum is taken over all 
jointly distributed random variables (X, Y) such that X ~ P, Y ~ Q, and Ep(X, Y) < D. Using 
the chain rule for relative entropy it is easy to verify that the earlier rate-function R(P, Q, D) can be 
expressed as 

R(P,Q,D) = mf[I m (P\\Q,D) + H(Q\\Q)}, (4) 
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where the infimum is over all probability measures Q on A. As we show in Section f4. 11 the minimizer 
of (JU exists and is unique, and we denote it by Q* P q d : 

Q*p,q,d = argmin[/ m (P||Q, D) + H(Q\\Q)\. 
Q 

Theorem 1. Entropy- Coding Performance: Suppose that X is a stationary ergodic source with 
first-order marginal distribution P, and that Q is an arbitrary (i.i.d.) codebook distribution with 
.D av < oo. Assume that D £ (.D m i n ,D av ) and that X satisfies (pSQC) at D for some p > 1. Then the 
rate of the entropy-coded scheme with memoryless codebooks C n with distribution Q, satisfies, 

]imsup-E[£ n (X?)] = limsup -H(N^\C n ) < I m (P\\Q* P Q D , D) bits/symbol, (5) 

where the expectation is taken over both the message X™ and the random codebook C n . This result 
holds as long as b > R(P, Q, D). 

We immediately obtain from this and Theorem A above: 

Corollary 1. Entropy- Coding Gain: Under the assumptions of Theorem 1, the rate gain of entropy- 
coding over the naive coding scheme is at least 

R(P, Q, D) - Im(P\\Qp, Q , D ,D) = H(Q* PiQiD \\Q) bits/symbol. (6) 

As shown in entropy coding without conditioning on the codebook yields a rate equal to the 
naive coding rate. Thus, the quantity in the right hand side of © can also be thought of as the 
rate-gain due to matching the entropy coder to the specific realization of the codebook. 

In Section we generalize and refine the bounds © and ©. The discussion there suggests that, 
as in the case of discrete memoryless sources [H^, these bounds are tight also for general memoryless 
sources; for sources with memory the inequality © is strict and hence the entropy-coding gain (jSJ) is 
larger. 

The measure Qpq p has an interesting coding interpretation that will be clarified further in The- 
orem 2: When n is large, the empirical distribution of the first matching codeword YJ l {N ri ) in the 
codebook is close to Q* P q d with high probability. In the case of discrete memoryless sources this 
phenomenon can be explained using the method of types as in The lower mutual informa- 

tion I m (P\\Q,D) represents the rate achieved by a fixed-composition codebook, namely a codebook 
consisting exclusively of codewords with type Q. Equivalently, I m (P\\Q,D) is the exponent in the 
probability that a source string will match a type-Q string with distortion D or less. In this light, a 
memoryless random codebook with distribution Q can be thought of as a union of polynomially many 
fixed-composition codebooks, where the proportion of words of type Q is exp[— nH(Q\\Q)]. Now, 
codewords of type Q = Q are very frequent in the codebook but their lower mutual information is 
very high (i.e., low matching probability), whereas codewords of type Q = Q* PD , where Q* PD achieves 
the rate-distortion function (|llj) . have the lowest lower mutual information (i.e., high matching prob- 
ability), but they are too rare in the codebook. Therefore, we can think of the achieving measure 
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Q = Q* P q D in (J3J) as corresponding to the codeword type that strikes the optimum balance between 
the competing requirements of high matching probability and high frequency in the codebook. 
Note that it follows from © and © that 

R(P,Q,D) > I m (P\\Q PQD ,D) > R{D) 

with equality in both inequalities if and only if Q achieves the rate-distortion function in 

Proof outline of Theorem 1. The equality in © follows simply from the observation that E[C n (Xf)] > 
H(N' n \C n ) combined with 



E[jr n (X?)] < H{N' n \C n ) + E 

(a) 



n 

Yu { [~ log /^PQ)) + 1] V;=L2" fc j+i} } 

,i=l 



+ 0(1) 



< H{N' n \C n ) + n(M p + l)Pr{iV n > L2 n6 J + 1} + 0(1) 

(*>) 

< H{N' n \C n ) + o{n), (7) 

where Ie denotes the indicator function of an arbitrary event E, (a) follows by Holder's inequality, 
and (b) follows from Theorem B. 

The existence and uniqueness of Q*pqe> is established in Section 0] The inequality in © is the 
main technical content of the theorem; its proof is given in Section |SJ □ 

3 The Entropy- Coding Gain 

In this section we illustrate the gain of entropy-coding over the naive coding scheme in two particular 
instances where it can be explicitly evaluated. As mentioned in the introduction, we first consider the 
case of Gaussian codebooks with large variance. Since such a codebook distribution is approximately 
uniform over the whole real line, it is tempting to think of the entropy-coded scheme as a "randomized" 
version of an entropy-coded, uniform lattice quantizer. Indeed, we show that, as the codebook variance 
grows to infinity, the rate achieved by the entropy-coded scheme is at least as good as the asymptotic 
rate of entropy-coded dithered quantization (ECDQ) 

Then in Section ^. 2l we consider a more general class of approximately uniform, or "asymptotically 
flat," codebook distributions, corresponding to appropriately defined exponential families. In this case 
we argue that the resulting compression performance can be determined in a way analogous to the 
analysis given for the Gaussian case. 

3.1 Universal Gaussian Codebooks 

Let X be a stationary and ergodic, real-valued source to be compressed, and suppose X has zero 
mean E{X\) = and finite variance a 2 = Var(Xi) < oo. We consider memoryless random codebooks 
generated according to the Gaussian distribution Q ~ N(0, r 2 ), and we take p be the squared-error 
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distortion measure p(x, y) = (x — y) 2 . Under these assumptions, the rate achieved by the naive coding 
scheme is jS], 



oo, D = 

^log(^)-(loge)(^ 



R(P,Q,D) = { il g(^)-(lo ge ) ^- D 2 y;r > , 0<D<a 2 + r 2 (8) 



D>a 2 + T 2 , 



where 



a i r 
v= 2 



t 2 + \/r 4 + 4,Da 2 



[Note that here D m \ n = 0, D av = a 2 + r 2 .] We observe that in this case R(P, Q, D) depends only on 
the first and second moments of the source distribution, and that asymptotically for large codebook 
variance it takes the form 

R(P : Q,D) = ±log(^)+o(l) (9) 

where o(l) — > as r 2 — > oo. 

Remark: In more familiar information-theoretic terms, the rate-function R(P, Q, D) can equiva- 
lently be expressed as 

R(P,Q,D) = inf [I(X-Y)+H(Q Y \\Q)} (10) 

where the infimum is over all jointly distributed random variables (X, Y) with values in A x A, such 
that X has distribution P, E[p(X,Y)] < D, and Qy denotes the distribution of Y; cf. |29j . 

This expression shows that, typically, the rate achieved by the naive coding scheme is strictly 
suboptimal, unless of course the source itself is memoryless and Q is chosen to minimize R(P,Q,D). 
In fact from (|ll)j) it is immediate that for a memoryless source with rate-distortion function R{D) we 
indeed have 

R(D) = inf R(P, Q, D), (11) 
Q 

where the infimum is over all probability distributions Q on A. 

Example. Known Variance. Now suppose that the source X is believed to be i.i.d. Gaussian with 
N(0,o~ 2 ) distribution. As is well-known ^[3]) for any D £ (0,cr 2 ) the optimal coding distribution 
is Q* ~ N(0,a 2 — D), therefore we construct memoryless random codebooks according to Q*. But 
instead of the Gaussian source we expected, we are faced with data from some arbitrary stationary 
ergodic X with zero mean and variance a 2 . From the previous example it follows that the asymptotic 
rate achieved by the naive coding scheme will be [substituting — D in (181) 1 

2~ log (l)) bits / s y mbo1 - 
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This is exactly the rate-distortion function of the i.i.d. N(0, a 2 ) source, so the rate achieved is the same 
as what we would have obtained on the Gaussian source we originally expected. This coincides with 
Sakrison's robust fixed-rate for a class of sources |26| . It is yet another version of the folk theorem 
that the Gaussian source is the hardest one to compress, among all real-valued sources with a fixed 
variance; cf. pUj . 

Turning back to the general case, suppose X is a zero-mean, stationary and ergodic, real-valued 
source, with variance a 2 = Var(Xi) < oo, and let the codebook distribution be Q ~ N(0,t 2 ). Choose 
and fix a distortion level D > 0. From Q we have that, for r 2 large, the rate achieved by the naive 
coding scheme is 

log(r) + O(l) bits/symbol 

which of course grows to infinity as r 2 — > 00. On the other hand, as we show in the following propo- 
sition the rate achieved by the entropy-coding scheme stays bounded, and for memoryless sources it 
coincides with the asymptotic (large vector dimension) rate of entropy-coded dithered lattice quanti- 
zation (ECDQ). This confirms the natural intuition that the behavior of a random codebook with an 
approximately flat distribution should mimic the behavior of an entropy-coded uniform quantizer. 

Proposition 1. Entropy- Coding Gain for Universal Gaussian Codebooks: Let X be a real-valued, 
zero-mean, stationary ergodic source, with finite variance a 2 = Var(Xi), let D > be a fixed distortion 
level, and take Q ~ iV(0,T 2 ). Let Q* P q D be as in Theorem 1. We have: 

(i) The measure Qpqd converges to P * iV(0, D) as t — > 00, in that Qpqd has a density fQpQ D (y) 
(with respect to Lebesgue measure) for all r large enough and 

kp QD (y) -» e p[<Pd{v ~ X)} as t 2 -> 00, for all y G R, 
where 4>d denotes the density of the N(0,D) distribution. 

(ii) The upper bound I m (P\\Q P Q D , D) to the rate achieved by the entropy-coding scheme satisfies 

lim I m (P\\Q*p QD , D) = I(X; X + Z D ) bits/symbol, 

where X ~ P, and Zd denotes a N(0, D) random variable independent of X. 

(iii) If the source X is memoryless then as r — > 00 the rate achieved by the entropy-coding scheme 
is no greater than 

R(D) + - bits/symbol, 
where R{D) is the rate-distortion function of X. 

A proof outline for Proposition 1 is given in Appendix A. As we will discuss in Section for 
sources with memory the entropy-coding gain is generally significantly larger. In fact when X is not 
memoryless: 
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1. the rate achieved by the entropy-coding scheme is actually equal to the mutual information rate 
L(X; X + Zd), where Zd is a white Gaussian process with variance D; 

2. the result in part (hi) of the proposition is valid for all stationary and ergodic sources. 

3.2 Approximately Flat Codebooks and Difference Distortion Measures 

We now extend the asymptotic result above to more general codebook distributions and to arbitrary 
difference distortion measures (not necessarily squared error). 

Suppose that A = A = M and that p is a difference distortion measure of the form p(x, y) = p{y — x) 
for some p : K — > [0, oo). Here we consider real- valued, stationary ergodic sources X, and codebook 
distributions Q that have a density /q (with respect to Lebesgue measure). 

We begin by deriving a lower bound for the rate-function R(P, Q, D), in the spirit of the Shannon 
Lower Bound |2T] . 

Lemma SLB. A "Shannon Lower Bound" for Difference Distortion Measures: Assume that the 
codebook distribution Q has a density /q (with respect to Lebesgue measure), and let Q max = 
sup y f Q {y). Then, 

R(P, Q, D) > log(l/Q max ) - /i max (£>), (12) 
where h mayL (D) is the maximum entropy associated with p and D, 

Vux(^) = max h(f) 

f: E f [p(Z)]<D 

and where h(f) = — J f(x) log f(x)dx = h(Z) denotes the differential entropy of a random variable Z 
with density /. 

Note that the lower bound (|12|) is independent of the source distribution P. 

Proof. Consider the infimum in (|1()|) . For any jointly distributed (X, Y) such that E[p(Y— X)] < D, 
let Qy and fy denote the measure and the density describing the distribution of Y, respectively. We 
can then write [3], 

L(X; Y) = h(Y) - h(Y\X) = h{Y) - h(Y - X\X) > h(Y) - h{Y - X) > h(f Y ) - h max (D) 

where the first inequality holds since conditioning reduces the entropy. On the other hand, we can 
expand 

H(Qy\\Q) = -h(f Y ) +E Qy [- log f Q (Y)] > -h(f Y ) + log(l/Q max ). 
Combining, we get 

L(X;Y) + H(Q Y \\Q) > log(l/Q max ) - /wGD). (13) 

Now note that we have implicitly assumed that Y has a conditional density given x for P-almost all x, 
but if it did not then the relative entropy H(W\\P x Q) = E P [H(W(-\X)\\Q)] = L(X; Y) + H(Q Y \\Q) 



14 



between the joint distribution W of (X, Y) and (P x Q) would be infinite, so the above bound would 
still hold. Therefore, in view of the right-hand side of (|13ft is also a bound for R(P, Q,D). □ 

Similarly to the Shannon lower bound for the rate-distortion function |21j . the lower bound above 
turns out to be tight for several interesting special cases. To see this we first derive an upper bound 
for R(P,Q,D). In ()10|) we can always pick 

Y = X + Z D , (14) 

where Zd ~ fjj is independent of X and it achieves the maximum entropy associated with p and D, 
i.e., h(Z D ) = h(f D ) = h max {D). For this choice I(X;Y) = h(Y) - h(Z D ) = h(f Y ) - h max {D), where 
/y is the density of Y = X + Zd- Therefore, 

R(P,Q,D) < h(f Y )-h max (D)+H(Q Y \\Q) (15) 
= E fY [-logf Q (Y)]-h max (D), (16) 

where Y = X + Zjj. 

Now if /q(2/) is continuous near y max = argmax^ /q(?/), and fy is concentrated around y m ax, then 
Efy[- log fq(Y)] log(l/Q max ), and the two bounds ((EJ) and (JEJ) are close. Since the lower bound 
(fT2|) is independent of P, closeness of the bounds would imply that R(P, Q, D) is only weakly dependent 
on the source distribution. For example, for a uniform codebook distribution Q ~ U[—K,K] we have 
Qmax = 1/2-fT so the lower bound is R(P,Q,D) > log(2i^) — h max (D). On the other hand, if K is 
large enough so that /y(y) = for \y\ > K, then Ef Y [— log fqiy)] = log(2i<Q, and the lower bound 
is tight. See also Lemma 1] 

More generally, suppose that the codebook distribution Q = Q s has an exponential density of the 
form 

f s {y) = B s exp(-sg{y)) , s > 0, (17) 

where g is any suitable (nonnegative) function with <?(0) = 0. Gaussian codebooks correspond to 
the case g(y) = y 2 , while uniform codebooks correspond to a "well-shaped" g. Moreover, for any 
"nice" g (as stated rigorously in the next lemma), as s — > the exponential density f s {y) tends to be 
locally uniform relative to the Y of (fTl)) . This explains the following asymptotic characterization of 
R(P,Q,D). 

Lemma TIGHT. Asymptotically Flat Codebooks: For a difference distortion measure and an expo- 
nential codebook distribution Q = Q s of the form l(T7|) . if E[g(X + Zjj)] is finite, then the lower bound 
((T2|) becomes tight as s — » 0: 

R(P,Q S ,D) = log(l/5 fl ) - h max (D) + o(l). 

Proof. For Q = Q s we have <5 max = B s and y max = 0, so the lower bound (fT^|) is equal to 
log(l/f? s ) — /i m a X (-D)- On the other hand, — log fg(y) = log(l/i? s ) + sg(y), so the upper bound (|16j) 
is equal to log(l/i? s ) + sE[g(X + Zd)] — h max (D), which approaches the lower bound as s — > 0. □ 
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An interesting consequence of Lemma TIGHT is that, like for uniform codebooks, for very fiat 
codebook distributions Q s the rate- function R(P,Q S ,D) is almost independent of the source distribu- 
tion P. In particular, for a Gaussian Q ~ N(0, r 2 ) and any source P, 

R(P, Q,D) = \ log(2^r 2 ) - h max (D) + o(l) 

as r — > oo. If p = squared error, then h max (D) = (l/2)log(27reJD), and we obtain R(P,Q,D) = 
(1/2) \og(r 2 /eD) + o(l) as in ©. 

Another consequence of the asymptotic tightness of the upper bound Q16JI is that an additive 
maximum entropy noise channel of the form Q14|) asymptotically achieves the minimizations © and 
(|1U|) . This observation extends the asymptotic additive Gaussian noise channel characterization of 
I m (P\\QpQ D , D) in Proposition 1. We state this result in the following proposition and prove it in 
Appendix D. 

Proposition 2. Entropy- Coding Gain for Approximately Flat Codebooks: Let p be a difference 
distortion measure such that the niaxi niu ni entropy /i max (D) defined in Q13j) exists and is strictly 
monotonically increasing with D. Let Q s be any exponential codebook distribution of the form l|17|) 
such that E[g(X + Zjj)] is finite. Then: 

(i) The measures Q*pq s £> converge to P*/d as s — ► 0, where P*Jd is the distribution of Y — X-\-Zjj, 
and where Zd is the random variable achieving h max (D). This convergence is in the sense that 
the density of Q*pq s e> converges to the density of Y = X + Zd- 

(ii) The upper bound I m (P\\Q* P Q sD , D) to the rate achieved by the entropy-coding scheme satisfies 

lim I m (P\\Q* PQaD ,D) = I(X; X + Z D ) bits/symbol, 
where X ~ P, and Zd is the maximum entropy random variable achieving h mSuX (D). 

(iii) If the source X is memoryless, then as s — > the rate achieved by the entropy-coding scheme is 
no greater than 

R{D) + C* bits/symbol, (18) 
where the universal constant C* is defined as 

C* = C*(p,D) = sup I(U;U + Z D ). 

U: Ep(U)<D 

Note that C* is an upper bound for the "min-max capacity" defined in |35j in connection with the 
Wyner-Ziv problem. In particular, for an rth-power distortion measure p{x — x) = \x — x\ r , we have 
0.5 < C* < 1 bit, and for r = 2 we have C* = 1/2 bit in accordance with Proposition 1 (iii); see |32| . 

We note that the codebook distribution Q s in Proposition 2 is independent of the source, of 
the distortion measure, and of the distortion level. This implies a strong robustness property for 
a memoryless codebook drawn from Q s : For sufficiently small s, such a codebook is universal (in 
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the sense of the bounded loss in (jl8j0 for any source and any distortion criterion admissible by the 
proposition. Thus, we may imagine that we first fix the codebook; then we select the desired distortion 
criterion to generate the D-balls; and finally we let the source induce the codeword distribution which 
determines the entropy code. 



4 An Almost Sure Conditional Limit Theorem 
4.1 Preliminaries 

As before, we assume throughout this section that X is a stationary ergodic source and Q is an 
arbitrary codebook distribution with < D m [ n < D av < oo. Also we fix a distortion level D G 
(Drain, D av ). Under these conditions, from |HJ Theorem 2] we know that R(P, Q, D) is finite and strictly 
positive, and that the infimum in its definition in © is always achieved by some joint distribution 
W* with Ew* \p{X, Y)] = D. Moreover, since the set of W over which the infimum is taken is convex, 
from we know that this W* is the unique minimizer. 
Alternatively, R(P, Q, D) can be expressed as 

(ln2)R(P,Q,D) = sup[AL> - A(A)] = X*D - A(A*), (19) 

A<0 

where 

A(\)^E P [lnE Q (e^ x ^)~ , 
and A* is the unique negative real number with 

A'(A*) = D; (20) 

where prime denotes derivative, cf. jSJ Theorem 2]. 

Let Q' = Wy denote the Y-marginal of W* . From the above discussion and from equations (|lUj) . 
(|5J) and (j3J) it follows that the infimum in is uniquely achieved by Q' . That is 

Q' = W Y = Q* P ^ D 

where as before P = P\ denotes the first-order marginal of the source distribution. Now let Q n denote 
the empirical distribution induced by the matching codeword Z"' = Y 1 l (N n ) on A: 

A a A 1 



Qn = PY?{N n ) = -/Z 5z : 



n 
i=l 



4.2 Results 

Our first result says that, when n is large, Q n ~ Q' with high probability. This generalizes and 
strengthens the "favorite-type theorem" of 34 . 

Theorem 2. Empirical Distribution of the Matching Codeword ("Favorite Type"): 
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(i) For every (measurable) £ci, any 5 > 0, and P-almost every source realization ccf°, as n — » oo 
we have: 

Pr {\Q n (E) - Q'{E)\ >5 X? = x?} -> exponentially fast. 

(ii) With probability one, 

Qn Q', as n — > oo, 
where '=/-' denotes weak convergence of probability measures. 

As we will see below, Theorem 2 is a consequence of the following generalization of the conditional 
limit theorem (see, e.g., jSJ Ch.12] for the standard form of the conditional limit theorem). 

Theorem 3. Almost Sure Conditional Limit Theorem: Let Xf and Y™ be two independent random 
vectors with distributions P n and Q n , respectively, and write P n = Py™ for the empirical distribution 
of y™. For every (measurable) E C A, any 5 > 0, and for P-almost every realization as n — > oo 
we have: 

Pr{|P n (£) - Q'{E)\ > 5 p n (X 1 n ,y 1 n ) < D and X? = x?} -» exponentially fast. 

Note that since ^^(Xp, Y^ 1 ) ^ ^ is a rare event, the conditional probability in Theorem 3 would 
have been different without conditioning on a P-almost sure realization xf . We first deduce Theorem 2 
from Theorem 3 and then we give the proof of Theorem 3. 

Proof of Theorem 2. Observe that 
PT{\Q n (E)-Q'(E)\>5\x?} 

( = } Pr { \Qn(E) - Q'(E)\ > 5 |iV n = k and X™} Pr{X n = k | X?} 

k>l 

Y,^{\Pn(E)-Q\E)\>5\ Pn (X?,Y 1 n )<D and X?} Pv{N n = k | Xf } 



fc>i 

(6) 



fc>l 



Pr {|P n (£) - Q'{E)\ > 5 | ^(Xf, Y/ 1 ) < D and Xf} , 



where (a) and (c) follow from the fact that N n < oo, eventually with probability one (by Theorem A); 
and (b) follows from the observation that, due to the codewords' independence, the random variables 
/9 n (Xf , Y"(A;)), k = 1,2, . . . are conditionally independent given X". This implies that, given N n , the 
distribution of the matching codeword is exactly the same as the distribution of Y™ conditioned on 
the event {p n (Xj\ Yf) < D}. 

This together with Theorem 3 proves (i). From (i) and the Borel-Cantelli lemma we conclude that, 
for any measurable E C A, 

Qn(E) — > Q'(E), as n — > oo, w.p.l. 
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Since A is a Polish space, there exists a countable convergence-determining class £ = C A. 2 

Therefore, with probability one we have that 

Qn(Ei) — > Q'(Ei), as n — > oo, for all i, 

and this implies (ii). □ 
Proof of Theorem 3. The probability in Theorem 3 can be expanded as 

Pr { \P n (E) - Q\E)\ > 5 and p n (X?, Y?) < D | X?} / Pr^X?, Y™) < D \ X?} 

= Pr{P n (E) < Q'(E)-5 and p n {X?,Y?) < d\x?} /Q n (B{X^D)) 

+ Vr{P n (E) > Q'(E)+5 and Pn (X?,Yf) < £> | } / Q n {B{X% , D)). 

We only treat the first of the two terms above; the second one can be dealt with similarly. 

If Q'(E) < 5 there is nothing to prove, so let us assume that Q'(E) > 5. In view of Theorem B, it 
suffices to show that 

limsup-logPr{p n (£) < Q'(E)-5 and p n (X$,Y?) < D xf\ < -R(P,Q,D) w.p.l. (21) 

This will be proved by an application of the Gartner-Ellis theorem. Toward that end, choose and fix 
an arbitrary realization of X, and define a sequence of random vectors {£n} in as 



in = I !>n 



/ n 

i=l 



where the random variables {1^} are as in the statement of the theorem. Let n (A) denote the 
log-moment generating function of £ n , 



e n (A) = inE Q n 



A = (Ai,A 2 ) G (-oo,0] 2 



exp{Aip n (x?,y 1 ") + A 2 P„( J B)}j , 
By the ergodic theorem we have, for P-almost every realization xf°, 

1 1 n 

lim -0„(nA) = lim - VlnPg [exp{Ai / o(x i , F) + A 2 fe(r)}] 

n — >oo 77, «- — «vi n ^ — * " 



ri— >oo 77, 



1=1 



A 



6(A) = E P {In E Q [ex V {\ lP (X,Y) + X 2 I E (Y)}}} 



(22) 



where, by Jensen's inequality, the limiting log-moment generating function 0(A) satisfies — oo < 
(AiD av + A 2 ) < 0(A) < 0. It is easy to check (using the dominated convergence theorem) that 0(A) 
is differentiable, with partial derivatives 

|| = E wW [p(X,Y)], 



d\ 2 



W$\E), 



2 For example, let B be a countable dense subset of A, and take £ to be the collection of all open balls with rational 
radii centered at the points of B, together with all finite intersections of such balls. Then £ is countable (by construction) 
and it is easy to check that |5] Theorem 2.3] applies, verifying that £ is convergence-determining. 
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where is the probability measure defined on A x A by 

dW^(x,y) a exp{X lP (x,y) + X 2 I E (y)} 



d(P x Q) E Q [exp{A lP (x, Y) + X 2 I E (Y)}] ' 

and W>£ A) is the Y -mare inal of WW. 

Note that @(X) is a convex function, and define its convex dual, 

e*(z)= sup [(z, A) — 0(A)], z = (zi,z 2 )GR 2 , 
Ai<0, A 2 <0 

where (z, A) denotes the usual Euclidean inner product (z, A) = z±\i + z 2 X 2 . 

In view of (|22j) and of the above discussion, we can apply the Gartner-Ellis theorem [UJ Theo- 
rem 2.3.6] to conclude that, with probability one, the probabilities in ()21j) satisfy 

limsupilnPr \ PJE) < Q'(E) - 5 and p n (X?,Y?) < D X?\ < - inf @*(z 1 ,z 2 ), 
n-*oo n I ' J *ie[o,D],2!2e[o,g] 

with probability one, where we write q = Q'(E) — 5 > 0. From its definition, it is obvious that 
@*(zi, z 2 ) is nonincreasing in each of its coordinates. Therefore, to prove (|21l) and conclude the proof 
of the theorem it suffices to show (recall (|19|)) that: 

0*{D, q) > (In 2)R(P, Q, D) = X* D - A(A*). (23) 

To prove (|2"3*|) we consider 

5 (A 1 ,A 2 ) = A 1 D + A 2 g-e(Ai,A 2 ). 

Using the dominated convergence theorem as before we can differentiate g with respect to A 2 to get 
that for all (Ai,A 2 ) G (-oo,0] 2 , 



dfl(Ai,A 2 ) 

si = 9 ~ A Px( 

<9A 2 



j fy s exp{A lP (X,y) + A 2 I g (y)} 
. M > E Q [eM^iP(X,Y>) + X 2 I E (Y>)}} 



q-W^(E), 



where at the endpoint A 2 = this is understood as the corresponding right-derivative. Now, if this 
derivative evaluated at (Ai,A 2 ) = (A*,0) is nonnegative, i.e., if 

wf>°\E)<q<Q'(E), (24) 

then, after some simple algebra, '°' is easily seen to satisfy 

(In 2)H(W ( - X "^ \\PxQ) = X*D - A(A*) = (In 2)R(P, Q, D), 

and also 

E w{ x., a) \p(X,Y)] = A'(\*) = D 

(see (|2t)|l). Since as we remarked above R(P,Q,D) is uniquely achieved by W*, we must have that 
W (\*,a) _ and; in p ar ti cu lar, W^ ,0) = Wp = Q'. But this contradicts 
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Therefore, it must be the case that 



dg(\i, A 2 ) 



<9A, 



<0, 



(A*,0) 



which means that by taking A2 = A' slightly negative, we can make g(\*,X') strictly larger than 
g(\*,0) = (In 2) R{P,Q,D). Hence 

e*(D,q) = sup g(\ 1 ,\ 2 )>g(\*,\') > (ln2)22(P, Q, D), 
Ai, A2 



establishing (|23|) and thereby completing the proof. 



□ 



5 Entropy- Coding Performance 

Before giving the proof of Theorem 1 we need to state four simple Lemmas that establish some of the 
technical properties we need in the proof. 



5.1 Four Lemmas 

Recall the notation and assumptions of Section f4. II We begin with some preliminary lemmas. 
Lemma 1: 

lim sup inf H(Q\\Q)=H(Q'\\Q), 
^° {Pi} Q:\Q-Q'\<8 

where the supremum is taken over all finite partitions {F\, F2, ■ ■ . , Fk} of A, and for any such partition 
the infimum is over all probability measures Q on A such that \Q(Fi) — Q'(Fi)\ < 5 for all i. 

Proof. Since Q' is always among the measures over which the infimum is taken, we obviously have 
that the above left-hand side is no larger than the right-hand side. 

To prove the corresponding lower bound let e > arbitrary, and choose a finite partition V = 
{Fi, . . . such that H(Qp\\Q-p) > H(Q'\\Q) — e, where for any measure fi and any partition V 
we write /x-p for the corresponding discrete measure which assigns probability fi(Fi) to each i in the 
alphabet {1, 2, . . . , k}. The fact that this is possible follows from |25l Chapter 2] and the fact that 
H(Q'\\Q) < R{P, Q, D) < 00. Without loss of generality we assume that Q(Fi) / for all F { G V. 

By the uniform continuity of relative entropy on a finite alphabet, we can choose 5$ > small 
enough so that 

H(Qv\\Qv) > H(Q' v \\Q r )-e 

for all probability measures Q on {1, . . . , k} with \Q-p(F{) — Q-p(Fi)\ < 5q for all i. Then by the data 
processing inequality for relative entropy, for all 6 < 5o, 

inf H(Q P \\Q P ) > H{Q' V \\Q V ) - e > H{Q'\\Q) - 2e, 

Q:|0-Q'l<<5o 

and since e > was arbitrary, we are done. □ 
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The following lemma contains a simple observation based on Lemma 1; it is stated here without 
proof. 

Lemma 2: For any e > there is a 5 > and a finite partition V = {Fx, . . . , F^} of A such that 

R(5) = inf H(Q\\Q)> H(Q'\\Q)-e. 
Q-.\Q-Q'\<6 

Writing 1(5) = R(P, Q, D) - R(5), we have 

I m (P\\Q\ D) < 1(5) < I m (P\\Q', D) + e. 

Given a 5 > and a finite partition V = {Fi, . . . , F^}, let B$ denote the set of probability measures 
on A that are 5-close to Q' on the sets -F,: 

B S = {Q : \Q(F) - Q'(Fi)\ < 5 for all i}. (25) 

Lemma 3. Let X" and F™ be two independent random vectors with distributions P n and Q n , 
respectively, and write P n for the empirical distribution of Y™. Suppose a 5 > and a finite partition 
= {Fx, . . . , -Ffc} of A are given, and write 

Pn = Pn (x^, 5) = Pr{p n (x?, Y?) <D\P n e B 5 }. 

Then we have: 

lim --logp n (X?,S) = 1(5) w.p.l. 

n^oo n 

Proof. We expand 

p n (X?, 5) = Pr{p n (X?, Y?) < D and P n G B 5 } / Pr{P n G B s } 

and evaluate the exponential behavior of the numerator and denominator separately. First, by Sanov's 
theorem [7j, 

lim - logPr{P„ G B s \ = -R(5). (26) 

n— »ex> n 

We will also show that 

lim -logPT{p n (X^,Y^) < D and P n G 5 5 } = -R(P,Q,D) w.p.l, (27) 

n— >oo n 

and, recalling that i?(P, Q,D) — R(5) = 1(5), this will complete the proof. 
First note that, since 

Pr{p n (XJ\ F x «) < L> and P n G £ a } < D)), 
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Theorem B implies 



limsup - logPr{p n (Xf , Y?) < D and P n G B 5 } < -R(P, Q, D) w.p.l. 



(28) 



For the corresponding lower bound we employ the Gartner-Ellis theorem, much as in the proof of 
Theorem 3. Let xf be some fixed realization of X , and define a sequence of random vectors {Cn} m 
R k+1 by 

1 n 

Cn = ( P n(x?,Y?), P n (*l), . . .,P n (F k )) =-J2 ( P (x l ,Y i ),l Fl (Y l ), . . . MOQ) ■ 



i=l 



Let T n (A) denote the log- moment generating function of Cn, 

( k 



T n (A) = ln£ Q n 



exp \ \ oPn (x^ Y?) + hP n {Fi, 



i=i 



A = (A , . . • , A fc ) G (-oo,0] 



fe+i 



As before, the ergodic theorem says that for P-almost every realization xf 3 , 

k 



lim -T n {n\) = r(A) = E P \ In Ec, 

n— >oo n 



eM*op(X,Y) + Y,^M(Y)} 



i=i 



where, by Jensen's inequality, the limiting moment generating function T(A) satisfies — oo < (Ao-D av + 
Y2i=i ^i) — ^(^) — 0- Once again, a routine application of the dominated convergence theorem verifies 
that r(A) is differentiable, so the Gartner-Ellis theorem [§1 Theorem 2.3.6] yields that 

liminf ilnPr{p„(X 1 n ,y i n ) < D and P n G B s ] > -inf T*(z) w.p.l, 

n^oo n z 

where the infimum is taken over all z G R k+1 with z G [0, D) and \zi — Q'{Fi)\ < 5, i = 1, . . . , k, and 
r* denotes the convex dual of T, 



r*(z) 



sup [(z,A)-r(A)], zeR k+ \ 

Ae(-oo,0] fc+1 



where (z, A) is the Euclidean inner product in (z, A) = ^f=o z *^i- Therefore, with probability 



one, 



lim inf - log Px{p n {X^,Y^)<D and P n e B s } > -(log e)F* (D , Q' (Fi) , . . .,Q'(F k )), (29) 

n— >oo 71 

where we used the (easily verifiable) fact that T* is continuous in z G (-D m im Aw)- Finally we claim 
that 

(log e)T*(D,Q' (Fx),..., Q'(F k )) <R(P,Q,D). (30) 
Combining (|3T)|) with (|2"§)) and with the upper bound (|2"5|) proves (|2"7|) and the Lemma. 
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So it only remains to establish (j3()|) . Note that for any x £ A, any measurable function <f> : A — > M. 
which is bounded above, and any measure W on A x A, 

(ln2)H(W(-\x)\\Q(-)) > y cf>(y)W(dy\x) - \nE Q [e^ Y \ (31) 

[This can be proved in exactly the same way as the corresponding statement in the proof of [SJ 
Theorem 2].] Take W = W* to be the achieving measure in the definition of R(P,Q,D), and let 
4>{y) = Xop(x,y) + Y2i=i ^Fi(y) for some A G (— oo,0] fc+1 . Applying pTj) and integrating both sides 
with respect to P we get that 



R(P,Q,D) > (loge) 



\ E W » [p(X, Y)} + MQ'(Fi) - r(A) 

i=l 

> (log e) [(A, (D, Q'(Fi), Q'(F k ))) - T(X)] , 

and since this holds for any A S (— oo,0] fc+1 we have established (|30|) . as required. □ 

Finally we give a simple general result on the asymptotic behavior of the entropy of sequences of 
random variables. Its proof is in Appendix B. 

Lemma 4- Let £i, £2) • • • be a sequence of random variables, and A\, A2, ■ ■ .be a sequence of events 
with Pr{^4 n } — ► 1, as n — > 00. Assume that £ n £ {1, 2,3,..., 2 n ^}, for all n and some /3 < 00. Then, 



lim — 

n— »oo 71 



ff(£n)-tf(£#A n =l) 



0. 



5.2 Proof of Theorem 1 



Let e > be arbitrary, and choose a 5 > and a finite partition V = {Fi, • • • , i*fc} of A as in Lemma 2. 
With 5,5 as in ()25|) and with PY™(k) denoting the empirical distribution of the fcth codeword, for 
fc = 1, 2, . . . we define: 



IrA c o i if 1< fc < \2 nb \. 

1 if k > [2 nb \ + 1. 

Now we consider two sub-codebooks of C„ , 



C*f = : J k = 0, l<k< [2 nb \] 

C$ = [rm : J k = l, l<k<[2 nb \) 



Also, for j = 0,1, let be the index of the first codeword in cH that matches X™ with distortion 
D or less, and let Mn^ be the index of the position of Y{ l (Nn" > ) in Cn \ If no match is found in Cn\ 
then let Nn = = [2 nb \ + 1. From these definitions it immediately follows that, given C n , the 
value of N' n and the values of {M^ Nn ) , J Nn ) are in a one-to-one correspondence. 
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To bound E[£ n (X™)] we begin by expanding 
H(K\C n ) 



H{Mn Nn ^ , Jn„ \C r . 



< l + H(Mi JNj \J Nn ) 

= 1 + Pr{J Nn = 0}H(Mi JNj \J Nn = 0) + Pv{J Nn = l}H(M { n JN - ] \J Nn = 1) 

< 1 + Pr{J Nn = 0}log(L2 nfe J + 1) + H(M^\J Nn = 1), 



therefore, in view of (|7j) 



limsup-E[£ n (X?)} < limsup -H(N n \C n ) 

n— s-oo n — >oo Tl. 



< lim sup 



-H(M^\J Nn = 1) + - log(2 n6 + l)Pr{Q n £ B 5 } 

n n 



(a) 



(«0 



lim sup -H(mW I J Nn = 1) 
lim sup —H(M^), 



(32) 



where (a) follows from Theorem 2, and (b) follows from Lemma 4. Now recall the definition of the 
(conditional) probability p n = p n (Xf,5) from Lemma 3, and for arbitrary A > let E n ^ denote a 
quantized version of the exponent of p n : 



--T-!ogPn 

nA 



Note that, given a source string x™, the random variable M n has a "truncated Geometric" distribu- 
tion, which we denote by Geom*(p n ); formally, for a parameter q € (0, 1), 

Pr{Geom*((/) = k} = < 



q(l - qf- 1 if 1 < k < [2 nb \ 
(1 - qf- 1 if k = 1 + L2 nfe J 
otherwise. 



A useful bound on the entropy of a mixture of Geom* (q) distributions is given in the following lemma; 
its proof is given in Appendix C. 

Lemma 5: If the distribution of the random variable Z is a mixture of Geom* (q) distributions with 
q € [a,0\, then H(Z) < log(e/a). 

Now observe that, given En is equal to some 6 n , the conditional distribution of Jidji ^ is a mixture 
of Geom*(g) distributions, for parameter values q > 2~ nen . Therefore, by Lemma 5, 

H(M«|4 A ) = e n ) < ne n + log(e), 

and hence, with probability one, 

lim sup -H(M n ^ \E n A) =e n ) < lim sup E n A) 

n— >oo Tl n—>oo 
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< lim sup log p n (X? , 6) + A 

n— >oo W 

(a) 

< + A 

(6) 

< I m (P\\Q',D)+e + A, 

where (a) follows by Lemma 2 and (b) by Lemma 3. Since for all n large enough (l/n)H(Mn \E^ 
e n ) < (6 + 1/n) < (b + 1) with probability one, we can apply Fatou's lemma to get, 

limsup-tf(M«|£( A )) = limsup e\-H(M^\E^ = e n )X 



< lim sup -H(MW \E^ = e n )\ 

< / m (P||Q',D) + e + A. 



(33) 
(34) 



Next we will show that 

lim sup -H(MW) < limsup-iJ"(M^ 1) |^ A) ). 

n^oo W n-^oo Tl 

Since e > o and A > were arbitrary, combining this with (|32j) and (|33l) will complete the proof of 
the theorem. 

Turning to the proof of (|34[). we take e' > arbitrary, and define 



J ™ ^^^^(PjlQ'.DJ+A+e+e'}' 



and observe that, by Lemmas 2 and 3, 

Pr{/„ = 1} > Pr{-i logp n (X?, 5) < 1(5) + e'} -» 1 as n -> oo. 

We can expand 

F(M«|4 A )) > H(MW|/ n ,4 A )) 

> Pr{/ n = 1} H(M$p \I n = 1, 4 A )) 

= Pr{/ n = 1} [tf(M«,4 A )|I„ = 1) - H{E^\I n = 1) 

> Pr{/„ = 1} [H(M^\E^\I n = 1)-K 

(a) 



(35) 
(36) 



where 



> ff(MW|/„ = 1) + (Pr{7„ = 1} - ljlogfp"*! + 1) + ^'Pr{'n = 1} 



t _ >g| 7. W , C )U + i W +m| 



and where (a) follows since by 1)351) the number of values of is at most 2 K . Therefore, from this 
and (|36]> we have 

limsup-#(MW|I n = 1) < limsupi^(MW|^ A )). 

n— >oo Tl ji^QQ Tl 

Finally, (|3l)|) allows us to apply Lemma 4 and thus conclude that (|34|) holds, completing the proof of 
the theorem. □ 
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6 Tighter Bounds for Sources with Memory 



In this section we discuss the following questions: (1) Under what conditions is the bound in Theorem 1 
tight? (2) When it is not tight, what is the actual performance of the entropy-coded scheme? Only 
heuristic arguments and proof outlines are given. 

To gain some intuition, we first consider the extreme case of lossless compression of a finite- 
alphabet, stationary ergodic source X, that is D = relative to Hamming distortion. Let Q be a 
codebook distribution on A = A with Q(a) > for all a S A, and let C n be a memoryless random 
codebook with distribution Q. Then all possible n-strings from A n will appear infinitely often in C n , 
and the matching codeword will always be identical to the source string. Moreover, this also implies 
that Q*pQ£), the limiting first-order empirical distribution of the matching codeword (Theorem 2), will 
simply be the first-order marginal P of the source. It is therefore an immediate consequence of the 
AEP (Asymptotic Equipartition Property [5]) that the asymptotic rate achieved by this scheme will 
be exactly equal to the entropy rate H(X) of the source X. In this case it is easy to calculate the 
bound given in Theorem 1 explicitly to get that, at D = 0, 



The above argument indicates that the bound in Theorem 1 will be tight if and only if the source 
X is memoryless. Indeed, for finite-alphabet memoryless sources this was shown to be the case in 



Now let us turn to general alphabet sources and positive distortions D > 0. For general stationary 
sources, it is well-known that the rate-distortion function decreases as the memory increases, so it is 
natural to expect that the rate achieved by any "good" coding scheme will also take advantage of such 
dependencies. 

For the naive coding scheme Theorem A immediately shows that, if the codebook distribution is 
memoryless, then memory in the source does not affect the rate achieved. Formally, this observation 
is reflected in the identity [2"§] . 



In contrast, in the entropy-coded case we expect that memory in the source does affect the rate. For 
example, the above heuristic argument shows that for D = entropy-coding achieves the entropy-rate 
of the source, and not just H(X\). But since the bound I m {Pi\\Q Pi Q D , D) in Theorem 1 only depends 
on the first-order marginal Pi, memory in the source does not affect it and therefore it cannot be tight 
in this case. As we discuss next we can establish tighter bounds showing that, in fact, entropy-coding 
the index does take advantage of memory. This more desirable behavior is reflected in the multi- 
dimensional behavior of the lower mutual information (LMI) function: In contrast to ()37|) . whenever 



Im(P\\Q* P Q D ,D)=I m (P\\P,D) 



#(Xl). 



I3H- 



R(P k ,Q k ,kD) = kR(P l7 Q,D), for all k. 



(37) 



Pk + P k 




P k ,Q k ,kD> 



kD)<kI m (P 1 \\Q* PuQ>D ,D) 
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where Qp k nk kD is the (unique) fc-dimensional distribution Q k that achieves R(Pk,Q k ,kD) in 
Therefore, the LMI decreases due to memory in the source even if the codebook is memoryless. 

Example. Universal Gaussian Codebooks. To appreciate this decrease in LMI due to memory 
in the source, consider a memoryless Gaussian codebook with large variance r 2 and squared error 
distortion measure, as in Section [3.11 For a real- valued source X with zero mean and finite variance 
a 2 , a straightforward fc-dimensional extension of Proposition 1 gives, 

Jim Im(Pk\\Q* Pk Qk D , kD) = I(X*; X\ + zf) (38) 

where X\ ~ P k , and Z^ denotes an i.i.d. N(0,D) random vector independent of X k . Since Z^ has 
a density we can write 

i(x k -,x k + zf) = h{X k + zf) - kh(Z^). 

If X\ also has a density, then for small D this expression becomes h{X%) - kh(Z$) + o(l), where 
o(l) — > as D — > 0; see It follows that, for small D, 

I m (Pi\\Qh,Q,D,D) - \l m {Pk\\Qp k)QKkD ,kD) = HP,) - l h {P k ) + o{l) 

where lim^^o lina T 2_ >00 o(l) = 0. That is, for small D the LMI rate reduction relative to the marginal 
case is asymptotically 

as k — > oo. This is the information the past has about the present, which for some sources can be very 
large. 

In general, the tighter bounds on the rate of the entropy-coded scheme follow from natural k- 
dimensional extensions of the results in Theorems 1 and 2. As before, we restrict attention to memory- 
less random codebooks with arbitrary distribution Q, single-letter distortion measures, and stationary 
ergodic sources. As in ^Sj, the extension to the case with memory follows by considering fc-blocks 
of super-symbols in the source and the codebook, but the technicalities, although not particularly in- 
sightful, are very involved. The reader will have probably been convinced of this by seeing the proofs 
in the simpler memoryless case. Under the same assumptions as in Theorems 1 and 2 (and perhaps 
under mild additional regularity conditions on the source as in |18jh we obtain the following analogs. 

Theorem 1-k.: For any k we have: 

limsup-if(A^|C n ) < - I m (P k \\Q P Qk kD , kD) bits/symbol. 

^(k) 

Theorem 2-k.: Let Qn denote the kth order empirical distribution induced by the matching 
codeword Zf = Y{ l (N n ) on A. With probability one, for any k we have: 

Qn* Q*p k ,Qk,kDi as n > oo. 
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Following standard arguments used in the analysis of the rate-distortion function , we can define 
the LMI rate as 

Im(nQp,Q°°,D,D) = Mll m (P k \\Q* PkQkkD ,kD) = lim llm(P k \\Q*p,Qk ikD ,kD). 

K tit K — ^OO t\j 

It follows that the best upper bound on the index entropy is 7 m (P||Qp d ,D), and we conjecture 
that this bound is in fact tight, i.e., 

lim -H{N' n \C n ) =/ m (P||Q; )Q oo )D , J D). 

Example. Universal Gaussian Codebooks. Returning to the special case considered in the last 
example, if Q ~ -/V(0, r 2 ), then as the codebook variance t 2 — > oo the rate I m (P||Qp goo D , D) achieved 
by the entropy-coded scheme satisfies 

lim I m <F\\Qi,Q~D,D) = I(X; X + Z D ), (39) 

where Zr> is a white Gaussian process with variance D and I(X; X + Zr>) is the mutual information 
rate between X and Z^. (A simple heuristic calculation indicating that (|3*9*|) holds is to divide (|3*5|) 
by k and take k to infinity.) Combining this with the fact that I(X; X + Z D ) < R(D) + 1/2, f371l32] . 
where R(D) is the rate-distortion function of the entire process (not just the first-order rate-distortion 
function), we get that, as r 2 — > oo, the rate achieved by the entropy-coded scheme is no worse than 
R(D) + 1/2 bits/symbol. 
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Appendix 

A Proof Outline of Proposition 1 

Although the result of the proposition can be obtained by little more than elementary calculus, the 
calculations are rather lengthy so we only give an outline of the proof here. 

First observe that, in the notation of Section T4. 11 the log-moment generating A(A) can be evaluated 
explicitly, 

1, ~- 2^ A<7 2 



A(A) = — ln(l-2Ar 2 ) + 



2 v ' l-2Ar 2 ' 

and (|2fl]> can be solved to show that the optimizing value of A = A* is given by 

2D — t 2 — A 
lAr 2 ' 

where 

A = Vr 2 + 4ct 2 L>, 
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so that, as t 2 — > oo we have 



A 

A 2 



1 



4D 2 



+ 0(t- 



(40) 
(41) 



From the proof of jSJ Theorem 2] it follows that the joint distribution W* that achieves the infimum 
in the definition of R(P, Q, D) is given by 



dW* 



d(P x Q) 



(x,y) 



,\{x-y) 2 



E P [t 



and, as discussed in Section I1~T1 Q' is the Y-marginal W Y of W*. Therefore, writing 4> a (y) for the 
N(0,a) density, the density /g'(y) of Q' with respect to Lebesgue measure m(dy) can be expressed as 



f Q >(y) = d ^{y) = e p 

dm 



E P [t 



4>r 2 (y), 



where X and X' denote two independent random variables with the same distribution P. Evaluating 
the denominator explicitly and rearranging terms, the above expression becomes 



fQ'iv) 



7T 



E P 



exp 



A 2 



(42) 



and recalling (|4U|) and (|41|) . we can let r 2 — > oo to get that 



d(PxQ) (x ' y) S (2/) 



2T 2 



A 



7T 



exp < 



A 2 



— ► </>d(2/ — x) as r 2 — » oo. 
Invoking the dominated convergence theorem we can conclude that 

fQ'(y) -> E P [(f) D (y - X)] as r 2 ^ oo, 

as claimed. This proves (a). 

For part (b) note that, from (|3J and the above discussion it follows that 

I m (P\\Q,D) = R(P,Q',D)-H(Q'\\Q) 

= H(W*\\PxQ')-H(Q'\\Q) 

= H(W*\\PxQ) 

dW* dQ 

-{x,y)-j—(y)ln 



(43) 



(44) 



dW* 



dQ , , ( dQ' 



d(P x Q) v ' y; dm 



d(P x Q) {x > y) d^ {v) 



dm 



(y) 



dP(x)dm{y). 
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Using the expressions for the densities in 1)42(1 and l|43j). recalling that dQ/dm(y) = <f> T 2(y), and 
applying the convergence bounds in (|4T)|) . (|4T|) and it is straightforward to show that the last 

integrand in [• • •] above converges to 

a ( \ i ( ^p(y- x ) 

My-*)ln( Ep[My _ x)l 

Writing Vq for the joint distribution of the random variables (X, X + Zjj) as in the statement of the 
proposition, and Qd for the distribution of (X + Zd), the above expression can be rewritten as 



dV D 



(x,y)ln 



dV D 



d(P x my ,aj [d(P x Q D ) 
Finally, using (|iU)l and l|4*T|) to justify the use of the dominated convergence theorem we get that also 
the integrals converge, i.e., as r 2 — > oo, 



I m (P\\Q,D) 



dW* 



dW* 



d(P / ())( X,V ^ dm^ ln I d(P / OS""" dm''"' \dm 



dQ. , ( dQ ! 



■(x,y) — (y) 



(y) 



dP(x)dm{y) 



W^n) (X ' V) ^ { d(p d l D Q D ) {x > y) } dP ^)dm(y) 
= H(V D \\PxQ D ) 
= I(X;X + Z D ), 

proving (6). 

Finally part (c) follows from the well-known fact |37l 1321 I35j that the rate-distortion function 
R{D) of a real-valued memoryless source (with respect to squared error distortion) is bounded below 
by I{X;X + Z D )-lj2. □ 

B Proof of Lemma 4 



First observe that 



so that 



Also we can expand 
n 



H{i n ) < H(£ n ,I An ) = H(i n ) + H(I An \U) < mn) + 1, 

H{tn) - H(£ n ,I An ) 



lim — 

n^oo n 



0. 



(45) 



-H(l An ) + -H(U\lA n = l)Pr{^n} + ~H(^A n = 0)(1 - Pr{A n }) 

n n n 

O(i) + -H{i n \l An = 1) + (1 - Pr{,4 n })- \H(t n \l An = 0) - H(i n \l An = 1) 
n re L 

-H(Cn\l An = 1) + O(i) + (1 - Pr{A n })i log(2^), 
re re 



i.e., 



lim — ^ 



H(UMJ-H(UU n = i) 

Combining ()45|) with (|46|) proves the lemma. 



0. 



(46) 
□ 
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C Proof of Lemma 5 



It is well-known that the Geometric (non-truncated) distribution has the largest entropy among all 
nonnegative variables with a given mean. Now, it is easy to verify that if Z q is a Geom(g) random 
variable [i.e., if Z q = k with probability q(l — (?) for k = 1, 2, . . .], then E[Z q ] = 1/q, and 

H(Z q ) = Iog(l/g) - ^ log(l -q)< \og(e/q). 

Thus, since the mean of a mixture of truncated Geometric distributions is smaller than or equal to 
the mean of the Geometric distribution with the smallest parameter q (in our case, a), we obtain that 
H{Z) < H(Z a ) < \og{e/a). □ 



D Proof of Proposition 2 

We first introduce some convenient notation. Let W* denote the joint distribution minimizing 
H(W\\P x Q s ) in (J2J) (i.e., achieving R(P,Q S ,D)), and let Q* s denote its induced ^-marginal. In 
our previous notation, Q* = Qp,q 3i d and IiW*) = I m (P\\Qp,Q S) D, D), where IiW) is the mutual in- 
formation associated with a joint distribution W. Let Wadd denote the joint input-output distribution 
associated with the additive noise channel Y = X + Zd in (|14|) . In this notation, part (ii) of the 
proposition amounts to 

Im(P\\Qp,Q s ,D,D) = I{W* S ) -» 7(Wa dd ) 

as s — > oo. 

Clearly Wadd is i n the set of admissible distributions W in @. Moreover, by Lemma TIGHT we 
know that HiW&ddWP x Qs) — R(P, Qs,D) — ► as s — > oo. That is, W a dd asymptotically achieves the 
minimum of -H^W^H-P x Q s ). Since, by © and the Pythagorean theorem for divergence for any 
admissible W 

H(W\\P x Q s ) > R(P,Q S ,D) + H(W;\\W), 

we conclude that 

#(WC||W add )^0. (47) 

Since relative entropy dominates L\ distance, this implies that the density of W* converges to that of 
W'addi a fortiori proving part (i). 

Part (i) and the semi-continuity of the divergence |25j imply that 

liminfl(WC) >/(^ add ). (48) 
s— »0 

On the other hand, it follows from (|47|) and the chain rule for relative entropy, [Sj , that the conditional 
relative entropy //(W^HWaddl-P) -> as s -> oo. Alternatively, if we expand //(W^llTyaddl-P) i n terms 
of differential entropy, this becomes 

\im[h(Z D )-h(Y s *\X)]=0 
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where (X,Y*) are jointly distributed as W*, Zd is a maximum entropy random variable independent 
of X, and Ep(Y* — X) = D (equality here is due to the strict monotonicity of h miLX (D) as a function 
of D). Therefore, 

h(Y s *\X) h(Z D ) = h ma x(D). (49) 
We can also conclude from (|47jl that the relative entropy between the outputs vanishes 

limH(Q*\\Q Y ) = 0, 

where Qy denotes the distribution of Y = X + Zd. Again by the semi-continuity of the divergence 
this implies that 

limsup/i(y/) < h(X + (50) 
see (2^. Combining (fl9|) and (|50|) we thus have 

I(W;) = I(X;Y S *) (51) 

= h(Y s *)-h(Y s *\X) (52) 

< h(X + Z D )-h(Z D )+o(l) (53) 

= /(X;X + Z D ) + o(l) (54) 

where o(l) — > as s — > oo. This, together with (|48() . proves part (ii). 

Finally, part (iii) follows from [HE]- D 
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